Final Report

Author

Stolie Erickson, Mira Shlimenzon, Joseph Konat, Casey Avila

1 Analyzing the Relationship Between Average Daily Income Per Capita and Average CO2 Emissions Per Capita.

Source: Washington Post

In this report, we will analyze the relationship between average daily income per capita and average CO2 emissions per capita. As technology continues to evolve and countries engage in trade to minimize their opportunity costs, CO2 emissions have skyrocketed in recent years. With the importance and prevalence of managing climate change and the influence that CO2 emissions have over its progression, we hope to find a link between these variables in order to obtain information regarding alternative explanations for the rise in carbon emissions.

1.1 Data Description

The two variables that we are interested in are the average daily income per capita (explanatory variable) and how it is related to CO2 emissions per capita (response variable).

1.1.1 Income Dataset

The income dataset contains information regarding each country’s average daily household income based on a 2011 survey. These are measured in constant international dollars (PPP), converting the dollars to amounts that equalize the purchasing power across different currencies/regions. Our data cleaning will reveal more information regarding missing values and numbers that are replacing missing data. On Gapminder, this dataset is titled “Average daily income per capita, PPP (constant 2011 international $)”.

1.1.2 CO2 Dataset

The CO2 data set describes the consumption based CO2 emissions in million tonnes of CO2 per capita. These values are based on the Global Carbon Project 2021. On Gapminder, this dataset is titled “CO2 emissions per capita (Consumption based)”.

Note: This is all of the available information for this dataset on Gapminder.

1.2 Hypothesis

According to additional research, we anticipate that more economically developed countries (those with higher average daily incomes per capita) will be correlated with higher average CO2 emissions compared to countries with lower average daily incomes per capita (Center for Global Development).

1.3 Data Cleaning

1.3.1 CO2 Dataset

Upon initially looking through the CO2 data, we noticed that less economically developed countries in earlier years have repeated emission values for extended periods of time. We believe that it is unreasonable for CO2 emissions per capita to remain exactly the same for 30+ years for several rows. Under the assumption that predictive modeling was used to fill in data which was unknown dating back to 1800, we will be taking steps to clean it. In order to prevent our data from skewing, we are creating a function that will take in each row of the data set, and replace the emission values with NA’s until the emission initially changes. For example, since Australia has an emission value of 0 from 1800 to 1859, then changes in 1860 to .33, our function will sequence through the row and replace all of the 0’s with NA until 1859 (it will keep only the last 0 before the change) then the rest of the row will remain unchanged. Additionally, we noticed how the minus signs of the CO2 emissions per capita were not actual minus signs, so we replaced them accordingly.

1.3.2 Income Dataset

This dataset suffered from a similar problem to that of the CO2 dataset, so we can re-use our cleaning function on our average daily income dataset.

1.4 Final Cleanup and Pivoting

1.4.1 CO2 Dataset

Our CO2 dataset also included negative values, which were encoded as strings due to them using a “−”, rather than “-” (note the slight difference). This was fixed by replacing the dash with a traditional hyphen so that we could convert the string to a numeric value. Additionally—rather than cleaning out these values—we assumed that the negative CO2 emission valuescould potentially be from countries that actually filter more CO2 out of the atmosphere than they emit, for example tree/vegetation dense areas that are able to remove large amounts of carbon dioxide.

1.4.2 Income Dataset

This data set included values from the future (2023 - 2050), which were predicted based off of GDP growth rates for a given country. We removed all values after 2020 in order to match our CO2 dataset and make joining simpler. We also wanted to avoid any predictive modeling that the income dataset may have used in order to obtain values for years that have not already occurred.

1.5 Joining the Data

While merging our data, we wanted to ensure that both the explanatory and response variables remained grouped by both the country for which the data was collected and the year. This will ensure that if we want to analyze specific countries then we can group by that aspect, similarly with the year.

We decided to drop all of the rows that had missing values for either average emissions or average income in the merged data set. This is because with regards to correlation and linear regression, it is generally recommended to remove rows with missing values. When there are missing values in either the predictor or the response variables it can create inconsistencies in the data. This could potentially lead to biased or inaccurate results. Although lm and ggplot automatically drop the NA values, if we plan on doing further summary analysis we want to use the data that is demonstrated in our models.

2 Linear Regression

2.1 Data Visualization

2.1.1 Relationship Between Variables

This graph demonstrates the relationship between average income per capita (x) and CO2 emissions per capita (y). Each point on this graph corresponds to a country and the year in which these metrics were measured. From the graph, we can see that the points appear to follow a positive, somewhat linear correlation (increases in average daily income are correlated with increases in CO2 emissions).

2.1.2 Relationship Over Time

In the above graph, we have displayed average income per capita vs emissions per capita for six different years: 1920, 1940, 1960, 1980, 2000, and 2020. We Included these intervals and faceted each graph so that the evolution of the relationship between income and relationship can be more easily seen. The interval of 1920 - 2020 was used due to the higher number of countries with both emissions and income data during these time periods. We used six different years specifically because we felt it gave the most clear and easy to visualize comparison between different time periods.

There appears to be a positive correlation between income and CO2 emissions. Over time, average income per capita has dramatically increased. Since 1960 there appears to be a stronger linear relationship between each variable, with an increase in both income and emissions per capita.

2.1.2.1 Animated Visualization

This visualization shows the relationship between our variables with the year animated, clearly showing the increase both average daily income and CO2 emissions over time.

From the animated visualization, we can see that as the year increases, the points become more variable. In the earlier years (1800 until about 1950) the points mainly remain clustered close to 0. However, as the animation progresses to 1950 and beyond, there are far more points, which are also much more spread out We see the fan shape of the data extend to exponentially larger average income per capita and average CO2 emission per capita values. There are several explanations for this, but the majority of our research attributes the pattern to and increase in fossil fuel burning for electricity and steel, along with oil for vehicles and manufacturing and technology skyrocketed (NOAA).

2.2 Linear Regression Model

Linear regression is a technique for modelling the relationship between two statistical variables (average daily income per capita and average CO2 emissions per capita). It assumes that there is a linear relationship between these variables, and by calculating their mathematical relationship, we can gain insight on how these variables are related as well as how to make predictions. For this dataset, we developed a linear regression model with our explanatory variable as average daily income per capita, and our response variable as average carbon dioxide emissions per capita. This will provide us with valuable information that will test our hypothesis as well as allow us to predict emission levels for given income levels.

For wrangling our data to use for the linear regression model, we decided to summarize our data so we have one x and y value for each country. However, due to the increase in both emissions and income in recent decades, we decided to take the average from 1990 to 2020. We also considered making several intervals of 30 years to increase our data availability, but decided that ultimately pulling data from more recent years in which emissions increased significantly would be the most insightful.


Call:
lm(formula = average_emission ~ average_income, data = linear_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7604 -1.2931 -0.7722  1.2543 16.9036 

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     0.50481    0.43040   1.173               0.243    
average_income  0.27777    0.01448  19.188 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.322 on 118 degrees of freedom
Multiple R-squared:  0.7573,    Adjusted R-squared:  0.7552 
F-statistic: 368.2 on 1 and 118 DF,  p-value: < 0.00000000000000022

This graph demonstrates the linear regression for average income per capita (explanatory), and average CO2 emissions per capita (response) from data timing across the years 1990-2020. Each country has an average income and emission value over the entirety of the 30 years, which is the time frame that contains the most recent and reliable numeric values. From the graph, we can see that our line of best fit using the ordinary least squares (OLS) closely follows the data points for the majority of our data. However, we can see that the points become more variable as they move away from the origin, which could affect the accuracy of our predictions.

2.2.1 Model Interpretation

\[average\_emission = 0.505 + 0.278*average\_income\] Where \(average\_emission\) is the emissions per capita of a country, measured in millions of tonnes of CO2, and \(average\_income\) is the average daily income per capita in constant international dollars.

Interpretation of the Intercept: For countries with an average income per capita of 0 constant international dollars, the estimated CO2 emissions per capita is .505 million tonnes.

Interpretation of \(average\_income\): Each increase in one constant international dollar of income per capita is associated with an increase of .278 million tonnes in CO2 emissions per capita for all countries similar to those in the study.

2.2.2 Variance Table

Variability Information
Variance Type Variance
Response Values 45.10085
Fitted Values 34.15482
Residuals 10.94603

2.2.3 Description of Model Fit

The variance (the distribution of a set of data points) of the response values showcases how average emission varies 45.10085 per capita over time. The variance of the fitted values highlights how much the values predicted by the model are distributed, resulting in a variation of 34.15482. The variance of the residuals describes the spread of the difference between the observed values and their corresponding fitted values. This discusses the variability the model doesn’t account for, which is 10.94603.

\[ 1 - \frac{residual\_variability}{response\_variability} = 1 - \frac{10.95}{45.10} = 0.7573\]

75.73% of the variation in average CO2 emissions per capita is explained by the linear regression model, using average income per capita as the predictor variable. This \(R^2\) value shows that our model is of reasonable quality, as it shows a moderate amount of correlation. Ideally, our proportion would be even higher (0.9 to 0.95) in order to make precise predictions, but our value of 0.75 is likely enough to establish a positive correlation between average income per capita and average CO2 emissions per capita.

3 Simulation

By using our linear regression equation, we can predict values that were not directly collected in the dataset. This can be done by plugging in a value for \(average\_income\) in our regression equation, which would result in a theoretical \(average\_emission\) value. Following from this, we can create a simulated dataset with each value calculated entirely from our regression model. To account for the fact that not every value falls precisely on the regression line, we add a random, normally-distributed amount of noise to each data point that matches our original dataset.

3.1 Visualizing Simulated Data

From the plots above, we can see that the actual data and simulated data both have positive, somewhat linear correlations. However, the actual data appears to have a fan shape, where the points near the origin are clustered very close together, and then as the x-value increases the points spread out. Comparatively, the simulated data clusters upward towards the y-axis from about 5-10 million tonnes of CO2, and is a thicker, more consistent cluster of points as a whole rather than a fan shape. This is because the noise we added in our simulated dataset is normally distributed and uniform across all incomes, while it is clear that the randomness of our observed dataset incomes scale linearly with income (as income increases, the “range” of random variation increases). This also explains why our average emissions are more normally-distributed as opposed to being extremely right-skewed as shown in the first visualizations.

This graph displays our simulated emissions data against our observed data. The dashed line at \(y = x\) represents an exact match between the two datasets. It is visible that the two datasets follow the slope of our fitting line. However, this graph also presents a difference between the two datasets. As discussed above, the simulated dataset has the same amount of noise regardless of the income amount (and thereby emission amount), while our observed data has less noise when values are low, and more when they are high. This is displayed in that low emission values are vertically dense (in the observed data), and horizontally spread out (in the simulated data).

3.2 Generating Multiple Predictive Checks

In order to test our linear regression model, we can generate multiple simulated datasets, regress this new data on our observed data, and look at the \(R^2\) value from this regression.

Below is a plot of the distribution of \(R^2\) values from 1500 simulated datasets.

In this plot, we see that the simulated datsets have \(R^2\) values between 0.45 and 0.7. This indicates that the simulated data is moderately similar to our observed data. On average, our simulated data accounts for around 56.1% of the variability in the observed data set.

4 Conclusion

Throughout our research, along with development of several visualizations, we can conclude that there is a clear positive relationship between average daily income per capita and average CO2 emissions per capita. Most notably, our linear regression proved that average daily income per capita is a significant predictor (p-value <.000) for average CO2 emissions per capita. The data we investigated was for countries around the world, over several different time frames. Ultimately, performing a multiple regression and introducing alternative explanatory variables could provide additional insight on factors that might have a higher correlation and direct influence over CO2 emissions per capita.

5 Works Cited

“A Grammar of Animated Graphics.” A Grammar of Animated Graphics •, gganimate.com/. Accessed 7 June 2023.

“‘Average Daily Income per Capita, PPP (Constant 2011 International $)’,‘CO2 Emissions Per Capita (Consumption Based).’” Gapminder, www.gapminder.org/data/. Accessed 7 June 2023.

“Carbon Dioxide Levels Race Past Troubling Milestone.” National Oceanic and Atmospheric Administration, www.noaa.gov/stories/carbon-dioxide-levels-race-past-troubling-milestone#:~:text=That%20all%20changed%20starting%20in,being%20pumped%20into%20the%20atmosphere. Accessed 12 June 2023.

“Developed Countries Are Responsible for 79 Percent of Historical Carbon Emissions.” Center For Global Development | Ideas to Action, www.cgdev.org/media/who-caused-climate-change-historically#:~:text=Developed%20Countries%20Are%20Responsible%20for,Global%20Development%20%7C%20Ideas%20to%20Action. Accessed 7 June 2023.

Mufson, Steven. “Coronavirus Is Driving down Global Carbon Dioxide Emissions to Levels Last Seen 10 Years Ago, Agency Says.” The Washington Post, 30 Apr. 2020, www.washingtonpost.com/climate-environment/2020/04/30/coronavirus-is-driving-down-global-carbon-dioxide-emissions-levels-last-seen-10-years-ago-agency-says/.

Ritchie, Hannah, et al. “CO2 Emissions.” Our World in Data, 11 May 2020, ourworldindata.org/co2-emissions.

“10  Predictive Checks.” Stat 331/531 Statistical Computing with R - 10  Predictive Checks, earobinson95.github.io/stat331-calpoly-text/10-predictive-checks.html#ch10-checkins. Accessed 7 June 2023.